Jess Goddard, Rich Pauloo, Ryan Shepherd, Noor Brody

Last updated 2022-04-15 16:25:03


Introduction

In this report, we present a nationwide water service spatial boundary layer for community water systems and explain the tiered approach taken towards this end. The water service boundary layer produced in this report should not be interpreted as a final product, but rather, one that may be improved with the assimilation of additional water service boundaries from states and improvements to the methodology.

This report builds on a previous technical memorandum (eda_february.html) that should be read as prerequisite material to understand the broader context, background, and exploratory data analysis (EDA) that informs the approach taken herein. The resulting national water service boundary layer is the product of a “Tiered Explicit, Match, and Model” (henceforth, TEMM) approach. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity.

  1. Whenever present, we use explicit water service boundaries (labeled data). These are spatial polygon data, typically provided at the state-level.
  2. In the absence of explicit water service boundary data, we use a matching algorithm to match PWSIDs by spatial intersection and name to TIGER place polygons. When a water system and TIGER place match one-to-one, we label this Tier 2a. When multiple water systems match to the same TIGER place, we label this Tier 2b. Tier 2b reflects overlapping boundaries for multiple systems–which require future validation. We validate this heuristic approach on explicit water service boundaries1, and find favorable results consistent with findings by Patterson et al. (2022), who showed that Census TIGER Places generally agree with municipal boundaries.
  3. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2a or Tier 2b), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided PWSID centroids, and model a spherical water system boundary (Tier 3). Centroid accuracy issues that remain intractable at this point in time are discussed in the Limitations and Recommendations sections.

Below is a conceptual diagram of the three tiers in the TEMM approach. Tier 1 explicit boundaries always supersede Tier 2 matched proxy boundaries, which in turn always supersede Tier 3 modeled boundaries. Thus, the resulting the resulting water service boundary layer described in this report is combination of all three tiers, and depends on data availability for the water system, and whether or not it matches to a TIGER place.

Fig 1. Conceptual diagram of water service boundary tiers

Fig 1. Conceptual diagram of water service boundary tiers

In the sections that follow, we summarize our approach for each of the three tiers. As we move from Tier 1 to Tier 3, uncertainty increases, hence, we describe each Tier with increasing detail and provide measures of validation and uncertainty for Tier 2 and 3 estimates.

Finally, we encourage the reader to consider the TEMM water service boundary layer not as a final product, but rather, one that may be improved with the assimilation of additional Tier 1 explicit data, improvements to the Tier 2 matching algorithm, and refinement of the Tier 3 model. Ultimately, this entire workflow may be superseded by nationwide Tier 1 data, which would reduce the problem we address in the scope of work to one of simple Tier 1 data assimilation and cleaning.


Data Sources

Table 1. Data sources for TEMM layer.

Data Source Description Link
EPA Safe Drinking Water Information Systems Public water systems “master list” with key attribute data and some tabular geographic data like cities served SDWIS MODEL
EPA Enforcement and Compliance History Online Public water system facilities archive, drawing lat/long data for facilities centroids from FRS ECHO Exporter
EPA Facilities Registry Service FRS regularly updates facilities data with lat/long information, which pipes into ECHO FRS Geospatial
US Census Bureau TIGER/Line (also called “TIGER Places”) US Census data of places–cities and towns–used to identify potential service area boundaries TIGER/Line Shapefiles accessed using R package tigris
Unregulated Contaminant Monitoring Rule UCMR 3 and UCMR 4 provide data on pwsid and zip codes served, which can provide higher integrity centroids where needed UCMR Occurence data
Homeland Infrastructure Foundation-level Data Mobile home park centroids MHP centroids
Labeled Water System Boundaries URLs on state pages for various water service boundary sources State URL tracking doc


Report outline

library(tidyverse)
library(sf)
library(fs)
library(rcartocolor)
library(geofacet)
library(mapview)

# don't allow fgb streaming: delivers self-contained html
mapviewOptions(fgb = FALSE)

# load environmental variable for staging path 
staging_path <- Sys.getenv("WSB_STAGING_PATH")
output_path <- Sys.getenv("WSB_OUTPUT_PATH")

# read TEMM spatial output for key resul summary stats below
temm <- dir_ls(path(output_path, "temm_layer/"), glob = "*geojson") %>% 
  # get the most up to date temm spatial geojson layer if there are many
  sort() %>% rev() %>% .[1] %>% 
  st_read(quiet = TRUE) %>% 
  # drop this column because it's read as a list and causes mapview trouble
  select(-service_area_type_code)

# Tier 1 labeled data
wsb_labeled_clean <- path(staging_path, "wsb_labeled_clean.geojson") %>% 
  st_read(quiet = TRUE)

# read the clean matched output and perform minor transforms for plots.
# this has the same number of rows as `temm`, but contains Tier 1 "radius"
d <- read_csv(path(staging_path, "matched_output_clean.csv")) %>%
  mutate(
    # transform radius from m to km
    radius = radius/1000,
    # indicate the tier to use for each pwsid
    tier = case_when(
      has_labeled_bound == TRUE ~ "Tier 1",
      has_labeled_bound == FALSE & tiger_to_pws_match_count == 1 ~ "Tier 2a",
      has_labeled_bound == FALSE & tiger_to_pws_match_count >  1 ~ "Tier 2b",
      has_labeled_bound == FALSE & is.na(tiger_match_geoid)  ~ "Tier 3"
    )
  ) 

# total water system count
nrow_d <- nrow(d)

# population served by all water systems
pop_total <- sum(d$population_served_count)

# calculate count and proportion of people served by each tier
pop <- d %>% 
  group_by(tier) %>% 
  summarize(
    count = format_bign(sum(population_served_count)),
    prop  = round((sum(population_served_count)/pop_total)*100, 2)
  )

# calculate count and proportion of water systems by each tier
sys <- d %>% 
  group_by(tier) %>% 
  summarize(
    count = n(),
    prop  = round((sum(count)/nrow_d)*100, 2)
  )

# read in sdwis original set of water systems for comparisons
acws <- read_csv(path(staging_path, "sdwis_water_system.csv")) %>%
  filter(pws_activity_code == "A" &
           pws_type_code == "CWS")

# number of active, community water systems in SDWIS
n_acws <- acws %>% nrow()

# population served by active, community water systems in SDWIS
pop_acws <- sum(acws$population_served_count)

The following outline reflects a summary of key findings, followed by a description of each Tier in the TEMM approach. Notably, the TEMM may be refined and re-run as new data sources are ingested or improvements are made to matching and modeling algorithms.

Key Results

  • A nationwide water service boundary layer is provided for 44,919 community water systems2 covering a total population of 306,876,850.
  • Population coverage rates per Tier:
    • Tier 1: 39.75% population covered (121,993,956 people)
    • Tier 2a: 22.41% population covered (68,758,215 people)
    • Tier 2b: 30.05% population covered (92,229,951 people)
    • Tier 3: 7.79% population covered (23,894,728 people)
  • CWS coverage rates per Tier:
    • Tier 1: 32.52% system covered (14607 systems)
    • Tier 2a: 21.12% system covered (9488 systems)
    • Tier 2b: 22.49% system covered (10104 systems)
    • Tier 3: 23.87% system covered (10720 systems)
  • The distribution of population in each of these Tiers varies by state (see state-level coverage map).
  • Favorable uncertainty metrics for Tier 2 and 3 estimates suggest that the TEMM approach provides a reasonable preliminary water service boundary layer.

Tier 1: Explicit boundaries

  • Data discovered for 12 states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA).
  • These data are considered “high quality” and used to model Tier 3 boundaries.

Tier 2: Matched TIGER proxy boundaries

  • 57.94% of water systems (26,026 / 44,919) match a TIGER place by spatial intersection between water system facility centroids and TIGER boundaries or water system name match to TIGER names.
  • Spatial and name matching is constrained within the reported state to prevent out-of-state mismatches.
  • When a system has multiple spatial or name matches, minimum spatial distance between the reported PWSID centroid and the matched TIGER Place centroid is used to break ties and assign a single proxy boundary.

Tier 3: Modeled boundaries

  • Models are based on centroids provided by ECHO and FRS. Where geometry accuracy is low (i.e. county centroid or state centroid), centroids of zip codes served from UCMR or mobile home park centroids are used, if available.

  • Random Forest, xgboost, and multiple linear regression models were fit to Tier 1 labeled boundaries to develop a model that estimates water system radii and spatial boundaries at all systems nationwide.

  • Linear regression models had the lowest error and best fit.

  • Circular water service boundary estimates were then calculated for all water systems; median, 5% and 95% confidence intervals reflect medium, low, and high water service boundary estimates.

  • The multiple linear regression fit has an impressive \(R^2 = 0.66\) and is mostly explained by service connection count, which intuitively makes sense: there is roughly linear scaling between service connection count and the spatial extent of a water service area. EDA, Random Forest, and xgboost models confirm the importance of service connection count as an explanatory variable.

Combine Tiers

  • After boundaries from all three Tiers are assimilated, matched, or modeled, they are combined into a final spatial data layer. As depicted in Figure 1 above, Tier 1 boundaries always supersede Tier 2 boundaries, which always supersede Tier 3 boundaries.

Limitations

  • Data and model limitations are discussed, and accompanied by recommendations in the following section.

Recommendations

  • Directions for future work to address the limitations of the TEMM approach are briefly discussed, in particular:
    • Develop an approach to refine Tier 2b polygons to disaggregate the boundary to the underlying water systems.

    • Centroid accuracy and the “pancake problem” may be improved by “re-centering” low-accuracy centroids with city/town spatial databases or acquiring new centroids from states.

    • Some water systems have a primacy agency code that conflicts with the spatial location (i.e., coordinates) of the water system. For example, water system with primacy agency code of state A, may have a centroid located in state B. This could be resolved by additional data cleaning.

Contributions

  • The significance and broader impact of the TEMM open source, nationwide water service boundary layer is briefly discussed.
  • The reader is oriented towards contributor guides and a vision for the continual development of the project is outlined.


Key Results

The key result of this study is a nationwide water service boundary layer. Here we show the proportion of population served by each TEMM Tier at nationwide and statewide scales.


Tier coverage - nationwide

In total, the TEMM data layer represents tap water delivery to 306.88 million people served by 44,919 water systems3. This amounts to 97.22% of the population reportedly served by active community water systems in SDWIS and 90.85% of active community water systems in SDWIS.

About 121 million people are covered by Tier 1 spatial data – impressive given that only 12 states that provided explicit boundary data. However, this relatively high coverage rate is unsurprising because these states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA) include notable centers of high-population like CA, TX, and PA.

Together, around 190 million people (62.16% of the population) are covered by either a Tier 1 or a Tier 2a spatial boundary. This is an underestimation–because many systems assigned to Tier 2b are likely to be moved to Tier 2a with further refinement. However, until further development of rules to disaggregate Tier 2b into Tier 2a and Tier 3 boundary types are developed, we evaluate the accuracy conservatively. Therefore the remaining approximately 92 million people (37.84%) represented in the layer are covered by a less accurate, Tier 2b or Tier 3 boundary. These results indicate relatively high confidence in the spatial accuracy of the resulting TEMM water boundary layer for a majority of the population served by community water systems.

pop %>% 
  kable(col.names = c("Tier", "Population count", 
                      "Population proportion (%)"), 
        align = rep("l", 3),
        caption = "Table 2. Population coverage by tier") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 2. Population coverage by tier
Tier Population count Population proportion (%)
Tier 1 121,993,956 39.75
Tier 2a 68,758,215 22.41
Tier 2b 92,229,951 30.05
Tier 3 23,894,728 7.79

Together, around 24095 community water systems (53.64% of systems in layer) are covered by either a Tier 1 or a Tier 2a spatial boundary. The remaining approximately 20824 water systems (46.36%) are covered by a Tier 2b or Tier 3 boundary.

sys %>% 
  kable(col.names = c("Tier", "CWS\ncount", 
                      "CWS\nproportion (%)"), 
        align = rep("l", 3),
        caption = "Table 3. Community water system (CWS) coverage by tier") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 3. Community water system (CWS) coverage by tier
Tier CWS count CWS proportion (%)
Tier 1 14607 32.52
Tier 2a 9488 21.12
Tier 2b 10104 22.49
Tier 3 10720 23.87


Tier coverage - statewide

Next, we show the proportion of population covered in the final spatial layer per TEMM Tier on a state-by-state basis. Notably:

  • When Tier 1 data is present (grey bars below), it tends to cover a majority of the population.
  • Together, Tiers 2a/2b (teal and dark blue bars below) cover more population than Tier 3 across nearly all states. In some states, Tier 2a is more common than Tier 2b.
  • Tier 3 coverage (orange bars below) is not uniform across states. This implies that the Tier 2 matching algorithm may perform better in some states, and may depend on factors that vary across states, like centroid accuracy and water system type (e.g., municipal).
# dataframe for barplot geofacet: population prop served by each tier
dg <- d %>% 
  group_by(primacy_agency_code) %>%
  # population per state
  mutate(state_pop = sum(population_served_count)) %>%
  ungroup() %>% 
  group_by(primacy_agency_code, tier) %>% 
  # calculate count/proportion of population served per state and tier
  summarise(
    count = sum(population_served_count),
    prop  = count/state_pop
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(primacy_agency_code, tier,
           fill = list(count = 0, prop = 0)) 

# sanity check the grouped summary above: this should return all 1
# group_by(dg, state_code) %>% 
#   summarise(s = sum(prop, na.rm = TRUE)) %>% 
#   pull(s) 

# geofacet of TEMM tier coverage per state in terms of population proportion
dg %>% 
  ggplot(aes(fct_rev(tier), prop, fill = tier)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  geofacet::facet_geo(~primacy_agency_code) +
  labs(x = "", y = "Proportion of Population Covered") +
  guides(fill = "none") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
Fig 2. TEMM Tier distribution by state

Fig 2. TEMM Tier distribution by state

Native American water systems are designated in their water system id (pwsid) by their respective EPA region. For a full map of EPA regions see here. Below we show that most of these systems lack Tier 1 data and have generally higher proportion of Tier 3 data compared to state primacy codes.

dg %>% 
  # filter to Native American primacy codes
  filter(! primacy_agency_code %in% c(state.abb, "DC")) %>% 
  ggplot(aes(fct_rev(tier), prop, fill = tier)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  facet_wrap(~primacy_agency_code) +
  labs(x = "", y = "Proportion of Population Covered") +
  guides(fill = "none") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
Fig 3. TEMM Tier distribution of systems in Native American territories-by EPA agency

Fig 3. TEMM Tier distribution of systems in Native American territories-by EPA agency

# Data on number of native american territory systems overall vs. final set
nam_rep <- d %>% filter(! primacy_agency_code %in% c(state.abb, "DC"),
                        ! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()
nam <- acws %>% filter(! primacy_agency_code %in% c(state.abb, "DC"), 
                ! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()


In total, of the 570 systems in Native American territories nationwide, the final database after data loss due to missing centroids represents 333 (58.42%) systems in Native American territories. A data table of the the above plots is provided below.

dg %>% 
  mutate(
    prop  = ifelse(is.na(prop), 0, prop),
    prop  = round(prop, 3)
  ) %>% 
  dt_make()

Table 4. TEMM tier distribution by state


Tier coverage - system size

Next, for each state, we quantify the proportion of systems in each Tier by water size based on population categories according to EPA guidelines:

tribble(
~`EPA classification`, ~`Population Size`,
 "Very Small",            "500 or less",
 "Small",                 "501 - 3,300",
 "Medium",                "3,301 - 10,000",
 "Large",                 "10,001 - 100,000",
 "Very Large",            ">100,000"
) %>% 
  kable(align = rep("l", 2),
        caption = "Table 5. EPA system size categorization") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 5. EPA system size categorization
EPA classification Population Size
Very Small 500 or less
Small 501 - 3,300
Medium 3,301 - 10,000
Large 10,001 - 100,000
Very Large >100,000

In states that lack Tier 1 boundaries, “Very Small” and “Small” systems tend to have higher proportions of Tier 3 boundaries (grey bars) compared to larger systems. This result is unsurprising because Tier 2 spatial matches hinge on connecting water systems to Census “Places” which tend to be more populous areas. We therefore expect that smaller systems lack a Tier 2 match, and are hence covered by Tier 3 boundaries. When Tier 1 data is present, it seems to cover all system sizes (i.e., CA, NJ, TX), with a slight bias towards covering larger systems (e.g., OK, WA, CT).

# water system size levels in table above
size_lvls <- c("Very Small", "Small", "Medium", "Large", "Very Large")

# dataframe for barplot geofacet: prop of each each tier per size class
dg2 <- d %>% 
  mutate(
    size_class = case_when(
      population_served_count <= 500 ~ "Very Small",
      population_served_count > 500 & population_served_count <= 3300 ~ "Small",
      population_served_count > 3300 & population_served_count <= 10000 ~ "Medium",
      population_served_count > 10000 & population_served_count <= 100000 ~ "Large",
      population_served_count > 100000 ~ "Very Large"),
    size_class = factor(size_class, levels = size_lvls)
  ) %>% 
  group_by(primacy_agency_code, size_class) %>%
  # total count of tiers per state and size class
  mutate(size_class_count = n()) %>%
  ungroup() %>% 
  group_by(primacy_agency_code, size_class, tier) %>% 
  # calculate count/proportion of tiers per state and size class
  summarise(
    count = n(),
    prop  = count/size_class_count
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(primacy_agency_code, size_class, tier, 
           fill = list(count = 0, prop = 0)) 

# geofacet of TEMM tier coverage per state in terms of population proportion
dg2 %>% 
  ggplot(aes(size_class, prop, fill = tier)) + 
  geom_col(position = "stack") + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  geofacet::facet_geo(~primacy_agency_code) +
  labs(x = "", 
       y = "Proportion of Systems Covered by Each Tier, Split by System Size",
       fill = "Tier") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank(), legend.position = c(0.95, 0.15))
Fig 4. TEMM Tier distribution by system size

Fig 4. TEMM Tier distribution by system size

Repeating the above analysis for primacy codes indicating Native American territories, we see that Native American systems are generally small or very small according to EPA size classification.

dg2 %>% 
  filter(! primacy_agency_code %in% c(state.abb, "DC")) %>% 
  ggplot(aes(size_class, prop, fill = tier)) + 
  geom_col(position = "stack") + 
  coord_flip() +
  scale_fill_carto_d() +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  facet_wrap(~primacy_agency_code) +
  labs(x = "", 
       y = "Proportion of Systems Covered by Each Tier, Split by System Size (Population)",
       fill = "Tier") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
Fig 5. TEMM Tier distribution of systems in Native American territories-by system size

Fig 5. TEMM Tier distribution of systems in Native American territories-by system size


A data table of Tier categories across system sizes, nationwide, is provided below. Proportion represents the proportion of Tier within the size class.

# dataframe for barplot geofacet: prop of each each tier per size class
d %>% 
  mutate(
    size_class = case_when(
      population_served_count <= 500 ~ "Very Small",
      population_served_count > 500 & population_served_count <= 3300 ~ "Small",
      population_served_count > 3300 & population_served_count <= 10000 ~ "Medium",
      population_served_count > 10000 & population_served_count <= 100000 ~ "Large",
      population_served_count > 100000 ~ "Very Large"),
    size_class = factor(size_class, levels = size_lvls)
  ) %>% 
  group_by(size_class) %>%
  # total count of tiers per state and size class
  mutate(size_class_count = n()) %>%
  ungroup() %>%
  group_by(size_class, tier) %>%
  # calculate count/proportion of tiers per state and size class
  summarise(
    count = n(),
    prop  = round(count/size_class_count,2)
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(size_class, tier, 
           fill = list(count = 0, prop = 0)) %>%
  dt_make()

Table 6. TEMM Tier distribution by system size

A data table of the the above plots, looking at Tier counts by primacy agency code, is provided below.

dg2 %>% 
  mutate(
    prop  = ifelse(is.na(prop), 0, prop),
    prop  = round(prop, 3)
  ) %>% 
  dt_make()

Table 7. TEMM Tier distribution by system size and primacy agency code


National boundary layer

The national water service boundary layer is too large (i.e., around 400 MB) to plot in this interactive report, thus we show a static map below to illustrate the coverage provided by the Tiers 1-3 in the proportions described above, and direct the reader to the TEMM spatial layer in the project repository.

include_graphics(here("docs/img/temm-nation.png"))
Fig 6. Nationwide TEMM water system boundary layer

Fig 6. Nationwide TEMM water system boundary layer

Here, we zoom into California, Nevada, and Oregon – three states with high proportions of Tier 1, 2, and 3 spatial boundaries respectively. The Tier 3 circular buffers in this interactive map represent median (50th percentile) model estimates. Click on the layers icon to toggle between Tiers.

# plot CA, NV, OR
temm %>% 
  filter(primacy_agency_code %in% c("CA", "NV", "OR")) %>% 
  select(pwsid, service_connections_count, tier) %>% 
  mapview::mapview(zcol = "tier", burst = TRUE)

Next, we review the construction of each Tier.


Tier 1: Explicit boundaries

Tier 1 boundaries are the most straightforward to understand: states provide explicit data, and when they do, these boundaries tend to cover nearly the entire population served by water systems. Tier 1 data missingness challenges the accuracy of downstream Tier 2 and 3 boundary estimates. At the time of writing, Tier 1 data is present for 12 states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA).

# summarize missing and present states in a dataframe
states_missing <- state.abb[!state.abb %in% unique(wsb_labeled_clean$state)]
states_present <- unique(wsb_labeled_clean$state)

states_df <- tibble(
  state_abbr = c(states_missing, states_present),
  status = c(rep("missing", length(states_missing)), 
             rep("present", length(states_present)))
)

# pull usa state geometries, project to input data CRS
usa <- USAboundaries::us_states(resolution = "low") %>% 
  st_transform(epsg) %>% 
  left_join(states_df) %>% 
  tigris::shift_geometry() %>% 
  # remove Puerto Rico
  filter(!is.na(status))

# map of missing states
ggplot() +
  geom_sf(data = usa, aes(fill = status), alpha = 0.5, color = "white") +
  scale_fill_carto_d(palette = "Pastel") +
  labs(fill = "") +
  theme_void() + 
  labs(title = "Labeled data coverage by state",
       subtitle = paste("Last updated:", Sys.Date()))
Fig 7. Presence/absence of explicit boundaries in US

Fig 7. Presence/absence of explicit boundaries in US

We assume that data provided by states is correct, and do not clean Tier 1 data beyond assimilating the data across states into a single layer.


Tier 2: Matched boundaries

A detailed overview of the Tier 2 matching approach is provided in the project Github repository at src/analysis/sandbox/matching/methodology.md, and a brief overview is provided here.

A master list of active, community water systems is derived from SDWIS data. Supporting locational information– such as city served– is joined from supporting SDWIS tables. Federal water facility data (i.e., ECHO, FRS) exist with spatial centroids (latitude/longitude of the water system facility), but no federal sources provide explicit spatial boundaries. While ECHO is a superset of FRS centroids, both data sources are used because they result in improved matching. Many centroids in the ECHO database are poor quality because they are merely the centroid of the county or state of the water system. In these cases, where available, zip code centroids are substituted based on the zipcode served data provided for select water systems in the UCMR dataset or for mobile homes, the MHP dataset.

We match water system name, city served name, and spatial attributes of these data (centroids) to US Census Place spatial polygons (TIGER shapefiles), which are assumed to represent reasonable proxy water system boundaries. Matched TIGER polygon boundaries thus represent what we call “Tier 2” boundaries. Multiple matches are possible among the different match strategies, and thus a set of steps outlined in the methodology are taken to assign the best match. Best matches are based on rules that are validated against systems with existing labeled boundaries. Many water systems match TIGER places one-to-one (Tier 2a). Still, multiple water systems can match to the same TIGER place. Where water systems match to the same TIGER boundary, the proxy boundary for those systems is perfectly overlapping. This makes sense–as many urban areas contain multiple water systems (Tier 2b). For the MVP model, we allow for the overlap of water systems within the same TIGER polygon (Tier 2b). A future step should include rules to assign all Tier 2b systems to either Tier 2a or Tier 3.

The uncertainty of using TIGER polygons as proxies for water system boundaries is evaluated in the next section.


Uncertainty

We estimate the uncertainty in the Tier 2 matching approach with a Modified Jacard Similarity, defined as the proportion of area overlap between the TIGER match and the underlying water service boundary. We calculate the Modified Jacard Index \(J\) by comparing Tier 2 matched TIGER geometries to explicit Tier 1 geometries.

\[ J = \frac{A_T \cap A_E }{A_E} \] where \(A_T \cap A_E\) is the intersection of the TIGER Place area (\(A_T\)) and the explicit Tier 1 water service boundary area (\(A_E\)). Thus, the quantity \(J\) is a closed interval between 0 and 1, inclusive: \(\{J \space \space | \space \space 0 \le J \le 1\}\). Importantly, \(A_T\) can fall outside of \(A_E\). The Jacard similarity \(J\) measures the proportion of coverage that \(A_T\) provides over \(A_E\).

# Tier 2 TIGER geometries for explicit boundaries
tiger <- path(staging_path, "tigris_places_clean.geojson") %>% 
  st_read(quiet = TRUE) %>% 
  select(tiger_match_geoid = geoid)

t2 <- temm %>% 
  filter(tier == "Tier 1" & !is.na(tiger_match_geoid)) %>% 
  st_drop_geometry() %>% 
  select(tiger_match_geoid, pwsid) %>% 
  left_join(tiger) %>% 
  st_as_sf() %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection
  st_transform(3310) %>% 
  st_make_valid() 

# Tier 1: explicit boundaries
t1 <- temm %>% 
  filter(
    tier == "Tier 1" & 
    pwsid %in% t2$pwsid
  ) %>% 
  select(pwsid) %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection, calc area
  st_transform(3310) %>% 
  st_make_valid() %>% 
  mutate(area_t1 = st_area(geometry)) 

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1$pwsid, t2$pwsid)

# calculate intersection of t2 over t1 and calculate area
f_intersection <- function(x){
  res = st_intersection(t2[x, ], t1[x, ])
  if(nrow(res) == 0){
    res = 0
  } else {
    res = res %>% 
      mutate(area = st_area(geometry)) %>% 
      pull(area)
  }
  return(res)
}

# perform the intersection - needs to be pairwise, hence the func above
#t1$area_intersection <- map_dbl(1:nrow(t1), ~f_intersection(.x)) # slow

# calculate modified jacard
#t1$jacard <- as.numeric(t1$area_intersection / t1$area_t1)

# load pre-computed result from above b/c the previous 2 lines take time
#write_rds(t1, path(staging_path, "t1_jacard.rds"))
t1 <- read_rds(path(staging_path, "t1_jacard.rds"))

# sanity check that intersection worked
# j = 6
# mapview(t1[j,]) + 
#   mapview(t2[j,], col.regions = "green") + 
#   mapview(st_intersection(t2[j, ], t1[j, ]), col.regions = "red")
# t1$jacard[j]

We calculate that 83.8% of Tier 2 matched Tiger Place boundaries spatially intersect their assigned explicit Tier 1 boundary (when present). The proportion of overlap (Modified Jacard) for the 83.8% of intersecting Tier 1 and 2 geometries is generally left-skewed, which indicates high overlap. A Jacard of 1 indicates complete overlap (i.e., the TIGER Place sufficient covers the explicit water system boundary), a Jacard of 0.1 indicates that only 10% of the Tier 1 explicit boundary is covered by the Tier 2 boundary, and a Jacard of 0 indicates zero percent overlap (i.e., non-intersection).

t1 %>% 
  filter(jacard != 0) %>% 
  ggplot(aes(jacard)) +
  geom_histogram(color = "white") +
  labs(x = "Modified Jacard Similarity") +
  theme_report
Fig 8. Histogram of modified jacard similarity index

Fig 8. Histogram of modified jacard similarity index

16.07% of Tier 1 and 2 boundaries do not intersect, however we show that the vast majority of these geometries are very close to one another, which is unsurprising, as matches are constrained within state. The distribution of distance between centroids of non-intersecting matches is shown below.

# a substantial portion (almost half) of matched tier 2 geoms don't intersect
# tier 1 explicit boundaries but are they close by? calculate centroid distances
t1_zero <- t1 %>% 
  filter(jacard == 0) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

t2_zero <- t2 %>% 
  filter(pwsid %in% t1_zero$pwsid) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1_zero$pwsid, t2_zero$pwsid)

# calculate pairwise distances between Tier 1-2 non-intersecting centroids
# with "zero jacard" (zd)
# output distance is in kilometers, because NAD83 is in meters and we divide by 1000
f_distance <- function(x){
  res = st_distance(t1_zero[x, ], t2_zero[x, ])
  return(as.numeric(res)/1000)
}
#zjd <- map_dbl(1:nrow(t1_zero), ~f_distance(.x)) # slow

# load pre-computed result from above because it takes a while to run
#write_rds(zjd, path(staging_path, "zero_dist_non_intersects.rds"))
zjd <- read_rds(path(staging_path, "zero_dist_non_intersects.rds"))

# histogram of distances in miles 
# 1 km = 0.621371 miles
km_to_mi = 0.621371

tibble(dist = zjd) %>% 
  ggplot(aes(dist*km_to_mi)) +
  geom_histogram(color = "white", bins = 200) +
  labs(x = "Distance (mi)") +
  coord_cartesian(xlim = c(0, 500)) + 
  theme_report 
Fig 9. Distance between matched Tier 1 and Tier 2 polygon centroids

Fig 9. Distance between matched Tier 1 and Tier 2 polygon centroids

The median distance between matching Tier 1 and 2 geometries is 9.15 mi and the interquartile range is 4.7 - 25.48 mi This suggests that non-spatially intersecting matches are still very close together and hence, may be considered adequate preliminary water service boundaries. However, if spatial accuracy close than a several miles is paramount, it may be determined that Tier 3 boundaries should be used instead of Tier 2 boundaries. In the absence of a Tier 1 boundary, where Tier 2 boundaries do not spatially intersect the provided ECHO centroid, we may decide at a later date to prefer the Tier 3 boundary, for instance, if the distance between the ECHO centroid and matched Tiger Place exceeds some reasonable threshold.

Results suggest that matching water system spatial boundaries and attributes (i.e., name) to TIGER Places is a valid approach to construct “proxy” water system boundaries. These findings are relatively consistent with findings by Patterson et al. (2022), who showed that Census TIGER Places generally agree with municipal boundaries and that these municipal boundaries are valid proxies for water service areas.


Tier 3: Modeled boundaries

In the absence of explicit spatial boundaries (Tier 1) and matched TIGER proxy boundaries (Tier 2), we statistically model an estimated boundary (Tier 3). Model specification hinges on the correlation between the radius of Tier 1 convex hulls4 (the response variable), and predictors that explain this response, such as service connection count, population served, ownership type and so on.

We experimented with 3 different models: random forest, xgboost, and multiple linear regression. These different statistical and machine learning models have different assumptions and performance, which are discussed in the sections that follow.

Ultimately, we selected the multiple linear regression model as our final model because it is computationally efficient, easily interpretable, provides confidence intervals to characterize uncertainty in the modeled boundary (this may be useful depending on the application of the model results), and finally because it avoids overfitting5.


EDA

In this section of the report, we expand on important data pre-processing steps, feature selection/engineering, and developed models.


Right skewed response

The response variable (radius) from the 12 labeled Tier 1 states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA) is right-skewed. The median radius is 0.54 mi, and the maximum radius is 33.05 mi Linear models assumes constant variance and normality, thus we log-transform the response to stabilize its variance, enable regression and inference approaches, prevent high-leverage radii from exerting too much influence on the model fit, and disallow predicted negative radii.

d %>% 
  ggplot(aes(radius*km_to_mi)) + 
  geom_histogram(bins = 100, col = "white") +
  labs(x = "Radius (mi)") +
  theme_report
Fig 10. Distribution of water system radius in Tier 1 systems

Fig 10. Distribution of water system radius in Tier 1 systems

The main drawback of log-transformation is that model coefficient units (e.g., slope of the regression fit) and measures of performance (e.g., RMSE) become difficult to interpret, however there are methods to handle this interpretation6. The benefits of log-transformation outweigh the drawbacks, thus we log-transform the response variable.

Moreover, to avoid over-fitting a model on a training set that over-represents the more common low-radius observations, we perform a stratified random sample on the response variable. This samples equal proportions of training observations from each of quartiles of the radius distribution, delineated by red dashed lines in the figure below.

d %>% 
  # calculate quantiles to show stratified random sampling
  mutate(
    p25 = quantile(radius, 0.25, na.rm = TRUE)*km_to_mi,
    p50 = quantile(radius, 0.50, na.rm = TRUE)*km_to_mi,
    p75 = quantile(radius, 0.75, na.rm = TRUE*km_to_mi)
  ) %>% 
  ggplot(aes(radius*km_to_mi)) + 
  geom_histogram(bins = 100, col= "white") + 
  geom_vline(aes(xintercept = p25), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p50), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p75), linetype = "dashed", color = "red") +
  scale_x_log10() +
  labs(x = "Radius (mi)") +
  theme_report
Fig 11. Distribution of system radii with quartiles

Fig 11. Distribution of system radii with quartiles


Correlated predictors

Population served and service connections are highly correlated (\(r\) \(=\) 0.84), even across different variables like owner type code. They also both exhibit log-normality, thus, like the response variable, they are log-transformed, which further impacts the interpretation of the regression fit7. Moreover, these two correlated predictors should not be used together in a linear model (i.e., to avoid multicollinearity which reduces the precision of estimated model coefficients), so we will only use one, and try a model that multiplies both predictors to see if combining them improves the model (albeit at the loss of some interpretability). We can however, use both variables in tree-based methods.

d %>% 
  ggplot(aes(population_served_count, service_connections_count)) + 
  geom_point(aes(color = owner_type_code), alpha = 0.2) +
  rcartocolor::scale_color_carto_d() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Poulation served", 
       y = "Service connections",
       color = "Owner Type Code") +
  theme_report
Fig 12. Service connections vs. population served

Fig 12. Service connections vs. population served

Same as above, but broken down by owner type code, it is clear to see that this correlation persists across ownership types. Importantly, although we lack Tier 1 data for Native American systems (owner type code = “N”), the population and service connection scaling is present, and so models trained on different Tier 1 ownership types are likely to generalize well to these systems.

d %>% 
  ggplot(aes(population_served_count, service_connections_count)) + 
  geom_point(aes(color = owner_type_code), alpha = 0.2) +
  rcartocolor::scale_color_carto_d() +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~owner_type_code) + 
  labs(x = "Poulation served", 
       y = "Service connections",
       color = "Owner Type Code") +
  guides(color = "none") +
  theme_report
Fig 13. Service connections vs. population served, faceted

Fig 13. Service connections vs. population served, faceted


Interaction effects

Linear regression uses correlations between features of a dataset (called predictors) to predict an outcome (i.e. the response variable). Importantly, the mathematical constraints of regression require that predictors are independent from one another, which is to say, they do not share mutual information. If predictors are not independent and interact with one another, this should be specified in the model to avoid less reliable results with higher standard errors.

Interactions between variables can be evaluated by plotting model predictors against one another and looking for differences in regression slopes.


Owner type code

Regression slopes differ for the service connections based on owner type code. This suggests an interaction term between service connection (and population by correlation) and the owner type. Most systems are owned by local governments (L) or private (P) entities. Moreover, Native American (N) systems are poorly represented in Tier 1 data – this is discussed in the Limitations and Recommendations sections and should be addressed in future work8.

d %>% 
  # remove 2 "N" codes that are not meaningful
  filter(owner_type_code != "N") %>% 
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~owner_type_code) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (mi)") +
  theme_report
Fig 13. Interaction plot: owner type code vs. service connections

Fig 13. Interaction plot: owner type code vs. service connections


Wholesaler

As before with owner type code, wholesaler status interacts with service connections, thus we include an interaction effect for it in the linear model.

d %>% 
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~is_wholesaler_ind) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (mi)") +
  theme_report
Fig 14. Interaction plot: wholesaler vs. service connections

Fig 14. Interaction plot: wholesaler vs. service connections


Service area type code

Service area type codes include all possible service areas of a water system–e.g. a daycare, mobile home park, residential. Because there are many permutations and possibilities, we select the top 4 service area type codes as features for the model, lump everything else into an “Other” category, and train on this feature. This reduces class imbalance across the categories. We include an interaction effect for service area type code.

# clean service area type code column
d <- d %>% 
  mutate(
    # split type codes in the "python list" into chr vectors
    satc = strsplit(service_area_type_code, ", "),
    # map over the list to remove brackets ([]) and quotes (')
    satc = map(satc, ~str_remove_all(.x, "\\[|\\]|'")),
    # sort the resulting chr vector
    satc = map(satc, ~sort(.x)), 
    # collapse the sorted chr vector
    satc = map_chr(satc, ~paste(.x, collapse = ""))
  ) 

After cleaning, there are 706 unique service area type code combinations, and most are “RA” or “MH”, indicating that most water systems serve residential areas and areas that include mobile home parks. Regression slopes appear similar, but with different y intercepts, which suggest that some service area types are larger for an equal service connection count. We may consider cleaning this further in a future iteration if domain knowledge indicates predictive power, although preliminary visual inspection (below) suggests little explanatory power.

# plot service connections v radius and facet by top 3 service area type code
d %>% 
  mutate(
    satc = fct_lump_n(satc, 3), 
    satc = as.character(satc), 
    satc = ifelse(is.na(satc), "Other", satc)
  ) %>%
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE) + 
  facet_wrap(~satc) +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service connections", y = "Radius (mi)") +
  theme_report
Fig 14. Interaction plot: service area type code vs. service connections

Fig 14. Interaction plot: service area type code vs. service connections


Models

We fit random forest, xgboost, and multiple linear regression models to a training set of 80% of the data selected by stratified random sampling to avoid overfitting to low-radius observations. We then evaluate model performance on the 20% holdout test set, discuss model uncertainty and limitations, interpret model results, and use the model to generate Tier 3 predictions on unlabeled data not covered by Tier 1 or 2 boundaries.


Random Forest

We first fit a random forest to the Tier 1 labeled data because the algorithm has less strict assumptions compared to regression, it can handle correlated predictors, and because the variable importance output quickly indicates which variables explain the most variance in the dataset. This can confirm our understanding of the relationships in the data explored in the EDA above and inform how to specify subsequent, more interpretable linear models.

We fit the rand forest regression to predict radius from service connections, population, owner type, service area type code, and wholesaler status. Service connections, population served, and owner type code appear important in segmenting the feature space, and service area type code and wholesaler status appear less predictive.

library(tidymodels)
library(vip)

# read full dataset 
df <- read_csv(path(staging_path, "matched_output_clean.csv")) 

# unlabeled data (du) and labeled data (dl)
du <- df %>% filter(is.na(radius))
dl <- df %>% filter(!is.na(radius))

# plit labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# model and workflow
rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_wflow <- 
  workflow() %>% 
  add_formula(
    radius ~ 
      population_served_count + 
      # importantly, the RF can have correlated predictors, so we add
      # service connections, and don't need to account for interactions
      service_connections_count + 
      # use the cleaned owner type code from preprocess.R, which converts 
      # 2 "N" owner type codes to "M" so that models can evaluate
      owner_type_code_clean +  
      is_wholesaler_ind + 
      satc
  ) %>% 
  add_model(rf_mod) 

# fit the random forest model
rf_fit <- fit(rf_wflow, train)

# show variable importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report
Fig 15. Random forest variable importance

Fig 15. Random forest variable importance

Next we fit the model on the test set, plot residuals and calculate an \(R^2\) goodness of fit metric and two standard metrics of error, the RMSE and MAE. For our purposes, the MAE is likely more important because it doesn’t penalize outliers as much as RMSE, and we care more about the overall appropriateness of the boundary than we do about preventing large outlying errors.

# predict on test set
rf_test_res <- test %>% 
  # select(radius) %>% 
  bind_cols(predict(rf_fit, test)) 

# plot residuals
rf_test_res %>% 
  ggplot(aes(log10(radius*km_to_mi), log10(.pred*km_to_mi), color = owner_type_code)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)",
       color = "Owner Type") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
Fig 16. Random forest test residuals, highlighted by ownership type

Fig 16. Random forest test residuals, highlighted by ownership type

# calculate goodness of it and error metrics
rf_metrics <- metric_set(rsq, rmse, mae)
rf_metrics(rf_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6577181
rmse 0.3804640
mae 0.3077117


xgboost

Like random forest, xgboost is a tree-based method that is relatively easy to fit, and unconstrained by linear model assumptions. Variable importance metrics were consistent with those examined in the random forest model. We tuned xgboost hyperparamaters across a grid of inputs (see plot below, used to select the best xgboost model) and actually found that – on the holdout test set – it performed worse than the random forest in terms of goodness of fit and error. At first surprising, this result makes sense given the nature of the more flexible xgboost model, the lack of adequate data that a more flexible model like xgboost would benefit from, and the underlying functional form of the relationship we aim to model.

# load hyperparameter tune space
xgb_res <- read_rds(here("src/analysis/sandbox/model_explore/etc/xgb_res.rds"))

# visualize model performance across tuning grid
xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n:learn_rate) %>%
  pivot_longer(min_n:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rsq") +
  theme_report
Fig 17. XGBoost model performance across tuning grid

Fig 17. XGBoost model performance across tuning grid

xgboost is a flexble machine learning model: it tends to outperform random forest when there is sufficient data to segment the feature space and learn from residuals. However, in our example, we lack substantial data with only a few tens of thousands of water systems, and only 3 high importance predictors, 2 of which are highly correlated. Thus, the similar performance of xgboost and random forest may be explained by not being able to improve on random forest, and slightly overfitting to the training set, which a less flexible model like linear regression will not suffer from. In statistical terms, low flexibility models also tend to maintain low variance, meaning they perform similarly across new test sets (i.e., less bias). Finally, because the underlying functional form of the relationship being modeled is strongly linear (see the section on Interaction Effects), there is not a great advantage in straying from linear methods.

# load the best fit model
final_xgb <- read_rds(here("src/analysis/sandbox/model_explore/etc/final_xgb.rds"))

# fit the final xgboost model on training data
xgb_fit <- fit(final_xgb, train)

# show variable importance
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report
Fig 18. XGBoost variable importance

Fig 18. XGBoost variable importance

# predict on test set
xgb_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(xgb_fit, test)) 

# plot residuals
xgb_test_res %>% 
  ggplot(aes(log10(radius*km_to_mi), log10(.pred*km_to_mi))) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
Fig 19. XGBoost test residuals

Fig 19. XGBoost test residuals

# calculate goodness of it and error metrics
xgb_metrics <- metric_set(rsq, rmse, mae)
xgb_metrics(xgb_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6540102
rmse 0.3833233
mae 0.3112511


Linear Regression

As explained in the review of random forest and xgboost models above, there is sound rationale for linear regression in the context of this problem. A strong (and intuitive) linear relationship is observed between Tier 1 water system radii and service connections. The linear model fit outperforms other models, is easily interpretable, and provides standard error metrics. The underlying data do not have a high number of observations that machine learning models would benefit from, and the parameter space is also not high in dimensionality. For these reasons, we select the linear model to estimate Tier 3 boundaries.

Unlike the random forest and xgboost models, we are careful to log transform log-normal predictors and response, and specify interaction terms where present. We experimented with different model specifications, including a model which combined the correlated population and service connection variables, however, this led to negligible improvement in the model fit and less interpretable model coefficients – thus in the final model specification we regress only on service connection. Interaction terms are added for owner type, service area type code, and wholesaler status. A simple linear regression on service connections alone has an \(R^2\) of 0.56; including these extra terms substantially improves the model fit (\(R^2 = 0.66\)) and reduces test error.

# re-read dataset and log transform the response - only for linear model
df <- read_csv(path(staging_path, "matched_output_clean.csv")) %>% 
  mutate(radius  = log10(radius),
         # multiply correlated predictors - however, this has negligible 
         # impact on model fit and makes it harder to interpret, so ignore
         density = population_served_count * service_connections_count)

# unlabeled data (du) and labeled data (dl)
du <- df %>% filter(is.na(radius))
dl <- df %>% filter(!is.na(radius))

# split labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# lm recipe
lm_recipe <- 
  # specify the model - interaction terms come later
  recipe(
    radius ~ 
      service_connections_count + 
      # use the cleaned owner type code from preprocess.R, which converts 
      # 2 "N" owner type codes to "M" so that models can evaluate
      owner_type_code_clean + 
      satc +
      is_wholesaler_ind,
    data = train
  ) %>% 
  # convert predictors to log10
  step_log(service_connections_count, base = 10) %>% 
  # encode categorical variables  
  step_dummy(all_nominal_predictors()) %>% 
  # specify interaction effects
  step_interact(~service_connections_count:starts_with("owner_type_code")) %>%
  step_interact(~service_connections_count:starts_with("satc")) %>%
  step_interact(~service_connections_count:starts_with("is_wholesaler_ind"))

# specify model and engine for linear model and rf
lm_mod <- linear_reg() %>% set_engine("lm")

# lm workflow
lm_wflow <- 
  workflow() %>% 
  add_model(lm_mod) %>% 
  add_recipe(lm_recipe)

# fit the linear model on the training set
lm_fit <- fit(lm_wflow, train)

# predict on the test set and bind mean predictions and CIs
lm_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(lm_fit, test)) %>% 
  bind_cols(predict(lm_fit, test, type = "conf_int"))

# plot residuals
lm_test_res %>% 
  ggplot(aes(radius*km_to_mi, .pred*km_to_mi)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
Fig 20. Linear model residuals

Fig 20. Linear model residuals

On the test set, linear model goodness of fit is comparable to random forest, but error metrics are notably lower.

# calculate goodness of it and error metrics
lm_metrics <- metric_set(rsq, rmse, mae)
lm_metrics(lm_test_res, truth = radius, estimate = .pred) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6622484
rmse 0.3429215
mae 0.2656831


Spatial residuals

We examine model residuals across space (i.e., Tier 3 predictions - Tier 1 explicit labels) to test for spatial autocorrelation in the model residuals. Although we do not calculate standard autocorrelation metrics, a high-level examination of the residuals across states shows that Tier 3 estimates are generally consistent with observed radii (most residuals are near 0 mi - indicating strong agreement and consistent with the 1:1 predicted vs observed plot above). Notable exceptions include MO and OK, which show systematic under-estimation (negative values indicate a smaller estimated radius compared to the observed radius) on the order of around 3 miles.

# Tier 3: Modeled boundaries: read and construct geometry
t3 <- path(staging_path, "tier3_median.geojson") %>% st_read(quiet = TRUE)

# examine residuals
t3 <- t3 %>% filter(!is.na(radius)) %>% mutate(r = .pred - radius)

# plot
t3 %>% 
  mutate(state = str_sub(pwsid, 1, 2)) %>% 
  ggplot() +
  geom_vline(xintercept = 1, color = "black", linetype = "dashed") +
  geom_line(aes(r*km_to_mi, color = state), stat = "density") + 
  rcartocolor::scale_color_carto_d() +
  facet_wrap(~state, scales = "free_y") +
  guides(color = "none") +
  labs(x = "Residual (km)", y = "Density") +
  theme_report
Fig 21. Spatial residuals (Tier 1 states)

Fig 21. Spatial residuals (Tier 1 states)


Confidence intervals

Linear model 5 and 95% confidence intervals are calculated and provided as an output with the median radius. Importantly, the range of these predictions is relatively narrow, and generally less than 0.6 mi.

t<- df %>% 
  select(pwsid, radius) %>% 
  bind_cols(predict(lm_fit, df)) %>% 
  bind_cols(predict(lm_fit, df, type = "conf_int", level = 0.95)) %>% 
  mutate(
    # exponentiate results back to median (unbiased), and 5/95 CIs
    across(where(is.numeric), ~10^(.x)),
    # calculate CI range to plot distribution
    .pred_range = .pred_upper - .pred_lower
  ) %>% 
  ggplot(aes((.pred_range)*km_to_mi)) +
  geom_histogram(bins = 100, color = "white") +
  #coord_cartesian(xlim = c(0, 2)) +
  labs(x = "CI range (mi)") +
  theme_report

Below is an example of one Tier 3 with a typical CI range (around 0.6 mi). Toggle to the Esri.WorldImagery basemap to view satellite imagery of the underlying location and compare this to the estimated range of the Tier 3 estimates.

# lm CI layers
t3m_med <- st_read(path(staging_path, "tier3_median.geojson"), quiet = TRUE)
t3m_cil <- st_read(path(staging_path, "tier3_ci_upper_95.geojson"), quiet = TRUE)
t3m_ciu <- st_read(path(staging_path, "tier3_ci_lower_05.geojson"), quiet = TRUE)

# plot examples
pwsid_select <- "IN5289012" # ~1km CI range
mapview(filter(t3m_ciu, pwsid == pwsid_select), 
        col.regions = "green", layer.name = "upper CI", 
         alpha.regions = 0.3) +
mapview(filter(t3m_med, pwsid == pwsid_select), 
        col.regions = "blue", layer.name = "median",
        alpha.regions = 0.3) +
  mapview(filter(t3m_cil, pwsid == pwsid_select), 
          col.regions = "red", layer.name = "lower CI",
          alpha.regions = 0.3) 


Model Coefficients

In the linear model specified above, when both the dependent (response) variable and independent (predictor) variables are log-transformed we interpret these coefficients as the percent increase in the predictor for every 1% increase in the response. In our linear model, only the service connections are log transformed, thus we interpret the service_connections_count coefficient (0.42) as follows: for every a 1% increase in the water system radius, we see a 0.42% increase in service connection count. Thus, a 25%, 50%, and 75% increase in the radius are associated with a 10%, 19%, and 27% increase in service connection count, respectively.

It is also worth noting the high prevalence of high p-value coefficients, which suggests that the coefficients of these variables would not have a substantial effect on the response variable9. Most interaction term coefficients fall into this category. By contrast, low p-value coefficients indicate predictors that, when changed, result in a change in the response variable.

lm_fit %>% 
  extract_fit_engine() %>% 
  tidy() %>% 
  knitr::kable(caption = "Table 8. Linear model coefficients for water system radii") %>% 
  kableExtra::kable_styling(full_width = FALSE)  
Table 8. Linear model coefficients for water system radii
term estimate std.error statistic p.value
(Intercept) 1.6711637 0.1571860 10.6317571 0.0000000
service_connections_count 0.4154577 0.0648016 6.4112272 0.0000000
owner_type_code_clean_L 0.2289093 0.1454270 1.5740492 0.1155032
owner_type_code_clean_M -0.1964070 0.1685112 -1.1655432 0.2438230
owner_type_code_clean_P -0.2828731 0.1452173 -1.9479303 0.0514473
owner_type_code_clean_S 0.4830355 0.1892295 2.5526435 0.0107035
satc_MU 0.4678033 0.1578150 2.9642516 0.0030403
satc_Other 0.1063745 0.0679815 1.5647553 0.1176674
satc_RA 0.2933555 0.0627314 4.6763740 0.0000030
is_wholesaler_ind_wholesaler 0.2790601 0.0481816 5.7918406 0.0000000
service_connections_count_x_owner_type_code_clean_L -0.0897143 0.0549454 -1.6327903 0.1025401
service_connections_count_x_owner_type_code_clean_M -0.0416230 0.0689349 -0.6038019 0.5459871
service_connections_count_x_owner_type_code_clean_P 0.0214787 0.0551942 0.3891472 0.6971744
service_connections_count_x_owner_type_code_clean_S -0.2653671 0.0731088 -3.6297545 0.0002849
service_connections_count_x_satc_MU 0.0037941 0.0571896 0.0663419 0.9471068
service_connections_count_x_satc_Other 0.1087154 0.0362928 2.9955093 0.0027456
service_connections_count_x_satc_RA 0.0626281 0.0349508 1.7918943 0.0731758
service_connections_count_x_is_wholesaler_ind_wholesaler -0.0926955 0.0155610 -5.9569194 0.0000000


Combine Tiers

After boundaries from all three Tiers are assimilated, matched, or modeled, they are combined into a final spatial data layer. Tier 1 boundaries always supersede Tier 2 boundaries, which always supersede Tier 3 boundaries, as depicted in the conceptual model in the Introduction. The combined TEMM data layer fulfills the motivation of this work, thus we characterize it in the Key Results.


Limitations

No model is without limitations. In this section, we highlight key limitations and identify potential future pathways to improve the model fit through additional data collection and/or cleaning.


Tier 2b Boundaries

Where water systems match to the same TIGER boundary, the proxy boundary for those systems is perfectly overlapping. Boundaries for these systems require further validation.


The “Pancake Problem”

# filter for pancake stacks
pancakes <- df %>% 
  filter(geometry_quality == "COUNTY CENTROID" & 
         has_labeled_bound == FALSE & 
         is.na(tiger_match_geoid)) 

# counties of big pancakes
big_pancakes <- pancakes %>% 
  count(county_served, primacy_agency_code, sort = TRUE) 

# plot
big_pancake_pwsid <- pancakes %>% 
  filter(county_served == big_pancakes$county_served[1] &
           primacy_agency_code == big_pancakes$primacy_agency_code[1]) %>% 
  pull(pwsid)

The “pancake problem” is fundamentally caused by low centroid accuracy (e.g., country or state centroid accuracy), and impacts PWSIDs that lack an explicit boundary (or TIGRIS match). Estimated radii for these systems result in circular buffers that overlap, like a stack of pancakes. To illustrate, we query for the biggest pancake problem (greatest number of pancakes in a single stack), which belongs to the county of PUTNAM in and has 78 Tier 3 predictions.

# visualize
temm %>% 
  filter(pwsid %in% big_pancake_pwsid) %>% 
  mapview(zcol = "pwsid")
# calculate count and proportion of pancakes
nrow_tier3 <- temm %>% filter(tier == "Tier 3") %>% nrow()

pancake_count_prop <- temm %>% 
  st_drop_geometry() %>% 
  filter(tier == "Tier 3") %>%
  group_by(geometry_quality) %>% 
  summarise(
    count = n(),
    percent_tier3 = round((count/nrow_tier3), 3)*100
  ) %>% 
  arrange(desc(percent_tier3)) 

Next, we quantify the count and percent of “geometry quality” values for all Tier 3 spatial boundaries. State centroid “geometry quality” (1.3% of Tier 3 data) is an extreme example of the centroid accuracy issues observed in the dataset. County centroids (50.2% of Tier 3 data) are more resolved than state-level centroids and are the most common scale at which pancake stacks form. Zip code centroids (20.6% of Tier 3 data) form a large proportion of Tier 3 data, but pancakes at this scale are not as inaccurate as county or state centroids and may be appropriately acceptable error for most analytical purposes. Finally, not all “geometry quality” values are associated with pancakes (for example, “GPS - UNSPECIFIED” points, when plotted, clearly do not form pancake stacks), but these make up a small proportion of Tier 3 data. Most Tier 3 systems suffer from the pancake problem.

pancake_count_prop %>% 
  knitr::kable(col.names = c("Geometry Quality", 
                             "Count", 
                             "Percent"), 
               align = rep("l", 3),
               caption = "Table 9. Geometry quality of centroids systems with Tier 3 boundary") %>% 
  kableExtra::kable_styling(full_width = FALSE)  
Table 9. Geometry quality of centroids systems with Tier 3 boundary
Geometry Quality Count Percent
COUNTY CENTROID 5377 50.2
ZIP CODE CENTROID 2203 20.6
ADDRESS MATCHING-HOUSE NUMBER 1059 9.9
INTERPOLATION-PHOTO 880 8.2
MHP MATCH 370 3.5
UCMR CENTROID 355 3.3
STATE CENTROID 140 1.3
NA 71 0.7
GPS - UNSPECIFIED 69 0.6
POST OFFICE NAME CENTROID 57 0.5
INTERPOLATION-MAP 47 0.4
ADDRESS MATCHING-BLOCK FACE 26 0.2
ADDRESS MATCHING-NEAREST INTERSECTION 7 0.1
GPS CARRIER PHASE STATIC RELATIVE POSITION 9 0.1
GPS CODE (PSEUDO RANGE) DIFFERENTIAL 7 0.1
UNKNOWN 15 0.1
ADDRESS MATCHING-OTHER 2 0.0
ADDRESS MATCHING-STREET CENTERLINE 4 0.0
CLASSICAL SURVEYING TECHNIQUES 1 0.0
GPS CODE (PSEUDO RANGE) PRECISE POSITION 1 0.0
GPS CODE (PSEUDO RANGE) STANDARD POSITION (SA OFF) 1 0.0
INTERPOLATION - DIGITAL MAP SRCE (TIGER) 2 0.0
INTERPOLATION-OTHER 4 0.0
INTERPOLATION-SATELLITE 4 0.0
PLACE NAME CENTROID 5 0.0
PUBLIC LAND SURVEY-QUARTER SECTION 1 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON ADDRESS MATCHING 1 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON GPS 1 0.0
US BUREAU OF CENSUS BLOCK ESTABLISHED FOR YEAR NOTED 1 0.0

Upstream centroid accuracy issues are beyond the scope of this study, and may be addressed in coordination with federal agencies that produce FRS and ECHO databases. A recommendation for how to address the pancake problem is outlined in the Recommendations section below.


Native American data gaps

We lack labeled Tier 1 data and hence radii for tribal primacy types (and territories, although this analysis excludes them), and can’t improve the model based on this information until it is collected. The model therefore, has no information on “Tribal” primacy types and all inferences are based on “State” primacy types. This may introduce error in modeled boundaries if it is found that system boundaries systematically vary based on the “primacy type” parameter.

d %>% 
  group_by(primacy_type) %>% 
  filter(tier == "Tier 1") %>%
  # Filter for Tier 1
  summarise(proportion_present = sum(!is.na(radius))/n(),
            proportion_present = round(proportion_present, 2)) %>% 
  knitr::kable(col.names = c("Primacy Type", "Proportion Present"),
               caption = "Table 10. Proportion of primacy type for water systems in Tier 1") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 10. Proportion of primacy type for water systems in Tier 1
Primacy Type Proportion Present
State 1


Recommendations

Beyond a discussion of how the TEMM spatial layer may be used (covered in the next section, Contributions) we advise the following improvements to the pipeline for creating the TEMM spatial layer. In order of importance:

  1. Data collection and refinement
    • Collect additional Tier 1 data. Explicit boundary data supersedes Tier 2 TIGER proxy matches and Tier 3 modeled water system boundaries. Moreover, additional Tier 1 data improves Tier 3 model predictions by providing more training data for similar systems with missing boundary data or a matching TIGER place. In theory, explicit boundary data for each state across the USA would supersede all Tier 2 and 3 matches and estimates and reduce the pipeline presented herein to a simple data cleaning and assimilation task. Potential buffering of state water line data to approximate water systems is another potential option for states with this kind of data, e.g. Kentucky.
    • Gather labeled boundaries for tribal systems. Tribal systems are currently underrepresented (and territories if coverage is to be extended to these regions). Coordination should target the collection of Tier 1 data for these systems.
    • Gather additional data to support technical efforts. In particular, centroids of cities and towns and state-provided municipal boundaries may be used to address the pancake problem and improve the matching algorithm, respectively (described below). Another area for data improvement is to fill in data gaps around population or connections served that result in the loss of systems with missing data.
  2. Data pipeline and methodological improvements
    • Address the “Pancake Problem”. Low-accuracy centroids challenge the usefulness of the TEMM layer and introduce uncertainty into analyses that depend on these data. An approach to re-center low-accuracy centroids (i.e., state and county level centroid accuracy) may be to “spatially dis-aggregate” or “spread out” the pancakes to more reasonable locations, like the centroids of cities and towns that fall within the county/state (depending on the centroid accuracy). Candidate cities for pancakes may be identified via a name match.
    • Improve “Match” and “Model” Tiers 2-3. Improvements may be made to Tier 2 matching and Tier 3 modeling. Specifically, matching may be improved by using state-provided municipal boundaries, which Patterson et al. (2022) showed to improve upon Census Place boundaries. Addressing multiple systems matching to the same Census Place is another area for improvement, to reduce the number of systems with overlapping boundaries. A further match rule might be incorporated to assign only one system to the full TIGER boundary, and apply Tier 3 model to the remaining systems.
    • Develop rules to handle boundary assignment for county-level systems, private systems, or water districts, for which Tier 2 boundaries may be less appropriate according to Patterson et al. (2022); such efforts would require substantial work to first categorize the system types, since no database disaggregating public or private systems by more nuanced types readily exists.
    • Create tools to visualize matches. A preliminary, minimal interactive Shiny App was developed to visualize candidate matches and support Tier 2 algorithm development. This tool may be expanded to further support iterative review of changes to the algorithm and better understand how match rules determine the end result.
    • Web hosting of the TEMM spatial layer. The developed TEMM spatial layer will be made available online as a flat file (i.e., .geojson, .shp, .csv) which is accessible to analysts. To make the results accessible to non-technical people and the general public, one option is to stand it up on a server for interactive use in a web browser.


Contributions

We developed a “Tiered Explicit, Match, and Model” (TEMM) approach to compile a nationwide water system boundary layer via an open source data pipeline. Such a dataset is unprecedented: presently, easily-accessible, machine-readable, and clean water system spatial boundary data – if available online – is siloed across states. This work brings together multiple spatial datasets where they exist, and develops algorithms and models to “fill in” missing data where it does not exist.

The spatial layer covers 44,919 community water systems and a total population of 306,876,850. Most people (282 million, 92.21% of the population) are covered by a Tier 1 or 2 spatial boundary, which have relatively high accuracy.

The data pipeline developed in this work can dynamically regenerate the TEMM spatial layer based on the addition of new Tier 1 explicit boundaries, refinement of the Tier 2 matching algorithm, improvements to the Tier 3 modeled boundaries, and changes to any other data dependencies (e.g., improved centroid location to solve the pancake problem). This pipeline is provided under the MIT License online at an open source Github repository along with output TEMM spatial layer results in geojson, shapefile, and csv format. Contributor guidelines are available on Github, and detail the processes by which contributions may be made to the project.

We imagine that the TEMM spatial layer evolving into an appropriate as an input for any analyses that depend on nationwide water service coverage boundaries.


Footnotes


  1. In a previous report (eda_february.html,“Section 6 TIGER proxy polygon appropriateness”), we show that matching PWSID to TIGER places is highly accurate: matched TIGER place polygons intersected explicit water service boundary polygons 93.16% of the time. In this report, we add more states, modify the matching algorithm, and further quantify this boolean (TRUE/FALSE) spatial intersection in terms of “percent overlap”. Thus, this older metric should be disregarded in light of new developments described herein.↩︎

  2. We drop 4,552/49,445 (9.2%) of water systems from this analysis for one of three reasons: (i) they lack a latitude or longitude (and hence we cannot “center” them anywhere), (ii) they fall outside of the 50 US States (e.g., territories like Puerto Rico, Guam, American Samoa), or (iii) their reported population or service connection count is less than 15, which is the minimum service connection count required for a system to be classified as a “community water system”.↩︎

  3. Some water systems were excluded from this data layer. Specifically, 4,526 water systems were dropped because they either had reported service connections less than 15 or population count less than 25 or were associated with US territories; most excluded water systems in this step are found in Puerto Rico. Future efforts should refine service connection and population data in these systems.↩︎

  4. Labeled radii are calculated from the convex hull area of Tier 1 systems instead of simple water system area because they better represent calculated Tier 3 circular buffers.↩︎

  5. “Overfitting” is the tendency of a statistical or machine learning model to fit well to “seen” training data, but generalize poorly to “unseen” testing data. We need the model to generalize well to unseen water systems. The linear model actually results in the lowest error of the models tested, which may at first seem surprising, but is actually not considering the strong linear relationship between water system radius and service connections. One might expect machine learning approaches like Random Forest and xgboost to outperform the linear model. In fact, they do on training sets, but tend to overfit the training set and not not generalize as well to test sets. Simply put, we to not have enough of the high-dimensional data that machine learning models require to outperform classical approaches like linear regression in the context of this problem.↩︎

  6. When only the dependent/response variable is log-transformed: exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.↩︎

  7. Both dependent/response variable and independent/predictor variable(s) are log-transformed: Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.↩︎

  8. There are 2 Tier 1 native american systems, which heavily impacts modeling: meaningful models can’t be fit to data with only two observations. Thus, we re-code the 2 existing “N” systems to “M” to train the Tier 3 models. With the addition of an appreciable number (e.g., > 50) labeled Tier 1 Native American boundaries, we can revert this re-coding step.↩︎

  9. The p-value in a linear regression tests the null hypothesis that the coefficient has no effect (i.e., coefficient = zero). A low p-value indicates evidence for rejecting the null hypothesis and a high p-value indicates otherwise.↩︎